The problem is that batch effect is assumed to affect gene expression on an individual gene level. In other words, we typically use adjustments, such as empirical bayes to control.
Consider a set of samples from a study involving two batches. Let the expression data be distributed as a set of MVN with mean vector \(\mu\) and covariance structure \(\Sigma\).
Existing batch correction methods are designed to identify differentially expressed genes and thus focus on \(\mu\) and the diagonal of \(\Sigma\).
For network inference, however, we are less interested in these values and more interesting in the off-diagonal of \(\Sigma\). Mechanisms for batch effect which act on the covariance structure rather than the mean and variance structure will not be corrected for using existing batch correction methods.
library(sva)
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-15. For overview type 'help("mgcv-package")'.
## Loading required package: genefilter
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:genefilter':
##
## area
library(ggplot2)
library(limma)
numSamples <- 1000
gene1 <- c(rnorm( numSamples,7),rnorm(numSamples,9))
gene2 <- c(rnorm(numSamples,7),rnorm(numSamples,9))
batch <- c(rep("A",numSamples),rep("B",numSamples))
df <- data.frame(gene1,gene2,batch)
ggplot(df, aes(x=gene1,y=gene2)) +geom_point(aes(col=batch),size=3, alpha=.5) +ggtitle("Raw data") +theme_bw()
correct <- c(rep(0,numSamples),rep(2,numSamples))
bc <- data.frame(gene1=gene1-correct,gene2=gene2-correct,batch)
ggplot(bc, aes(x=gene1,y=gene2)) +geom_point(aes(col=batch),size=3, alpha=.5) +ggtitle("Batch corrected data") +theme_bw()
This is batch correction working as intended.
But what if we can’t (won’t) assume that batch only affects mean and variance, but also impacts covariance? This is critical for Network inference.
genes <- mvrnorm(n = numSamples, c(7,7), matrix(c(1,.95,.95,1),nrow=2), tol = 1e-6, empirical = FALSE, EISPACK = FALSE)
gene1 <- c(genes[,1],rnorm(numSamples,9))
gene2 <- c(genes[,2],rnorm(numSamples,9))
df2 <- data.frame(gene1=gene1,gene2=gene2,batch)
ggplot(df2, aes(x=gene1,y=gene2)) +geom_point(aes(col=batch),size=3, alpha=.5) +ggtitle("Raw data") +theme_bw()
df2 <- data.frame(gene1=gene1-correct,gene2=gene2-correct,batch)
ggplot(df2, aes(x=gene1,y=gene2)) +geom_point(aes(col=batch),size=3, alpha=.5) +ggtitle("Batch corrected data") +theme_bw()
Existing methods do not address this “2nd order batch effect”
numGenes <- 1000
nBatch1 <- numSamples
nBatch2 <- numSamples
mvnMean1 <- rexp(numGenes)
mvnMean2 <- rexp(numGenes)
mvnSigma1 <- 30*diag(rexp(numGenes))
mvnSigma2 <- 30*diag(rexp(numGenes))
batchCovariates <- c(rep("A",nBatch1),rep("B",nBatch2))
data1 <- t(mvrnorm(n = nBatch1, mvnMean1, mvnSigma1, tol = 1e-6, empirical = FALSE, EISPACK = FALSE))
data2 <- t(mvrnorm(n = nBatch2, mvnMean2, mvnSigma2, tol = 1e-6, empirical = FALSE, EISPACK = FALSE))
data <- cbind(data1,data2)
design <- model.matrix(~factor(batchCovariates))
diffExp <- ebayes(lmFit(data, design))
limmaPVals <- diffExp$p.value[,2]
qplot(-log(1:numGenes/numGenes),-log(sort(limmaPVals))) + geom_abline(intercept=0,slope=1) +
ggtitle("QQ plot for uncorrected batched expression data") + xlab("-log(Expected)") + ylab("-log(Observed)") #+
combatData <- ComBat(data, batchCovariates, prior.plots=T)
## Found 2 batches
## Adjusting for 0 covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
combatDiffExp <- ebayes(lmFit(combatData, design))
combatLimmaPVals <- combatDiffExp$p.value[,2]
qplot(-log(1:numGenes/numGenes),-log(sort(combatLimmaPVals))) + geom_abline(intercept=0,slope=1) +
ggtitle("QQ plot for batched expression data") + xlab("-log(Expected)") + ylab("-log(Observed)")
So, in this example, we see that use of empirical Bayes correction allows for the control of false positives in differential expression analysis. Does it also control false positives in network inference?
library(nettools)
## Loading required package: DBI
##
wgcna <- mat2adj(t(data),method="WGCNA")
allR <- wgcna[row(wgcna)>col(wgcna)]
tStats <- sort((allR^(1/6))*sqrt(ncol(data)-2))
qplot(sort(abs(qt(1:length(tStats)/length(tStats),df=98))), tStats) + geom_abline(intercept=0,slope=1) +
ggtitle("QQ plot for raw data WGCNA") + xlab("-log(Expected)") + ylab("-log(Observed)")
We see that batch effect clearly leads to false positives in the absence of correction
combatwgcna <- mat2adj(t(combatData),method="WGCNA")
allR <- combatwgcna[row(combatwgcna)>col(combatwgcna)]
tStats <- sort((allR^(1/6))*sqrt(ncol(combatData)-2))
qplot(sort(abs(qt(1:length(tStats)/length(tStats),df=98))), tStats) + geom_abline(intercept=0,slope=1) +
ggtitle("QQ plot for corrected data WGCNA") + xlab("-log(Expected)") + ylab("-log(Observed)")
The use of Combat does a good job of limiting false positives in this simple case.
But… What if the batch effect impacts the covariance structure rather than the mean structure?
SigmaVec <- rep(0,numGenes^2)
SigmaVec[sample(numGenes^2,10000)] <- rbeta(10000,1,3)
mvnSigma1 <- matrix(SigmaVec, ncol=numGenes)
mvnSigma1 <- mvnSigma1 + t(mvnSigma1)
SigmaVec <- rep(0,numGenes^2)
SigmaVec[sample(numGenes^2,10000)] <- rbeta(10000,1,3)
mvnSigma2 <- matrix(SigmaVec, ncol=numGenes)
mvnSigma2 <- mvnSigma2 + t(mvnSigma2)
mvnMean1 <- rexp(numGenes)
mvnMean2 <- rexp(numGenes)
diag(mvnSigma1) <- colSums(mvnSigma1)
diag(mvnSigma2) <- colSums(mvnSigma2)
batchCovariates <- c(rep("A",nBatch1),rep("B",nBatch2))
data1 <- t(mvrnorm(n = nBatch1, mvnMean1, mvnSigma1, tol = 1e-6, empirical = FALSE, EISPACK = FALSE))
data2 <- t(mvrnorm(n = nBatch2, mvnMean2, mvnSigma2, tol = 1e-6, empirical = FALSE, EISPACK = FALSE))
data <- cbind(data1,data2)
design <- model.matrix(~factor(batchCovariates))
diffExp <- ebayes(lmFit(data, design))
limmaPVals <- diffExp$p.value[,2]
qplot(-log(1:numGenes/numGenes),-log(sort(limmaPVals))) + geom_abline(intercept=0,slope=1) +
ggtitle("QQ plot for uncorrected correlation-batched expression data") + xlab("-log(Expected)") + ylab("-log(Observed)") #+
combatData <- ComBat(data, batchCovariates, prior.plots=T)
## Found 2 batches
## Adjusting for 0 covariate(s) or covariate level(s)
## Standardizing Data across genes
## Fitting L/S model and finding priors
## Finding parametric adjustments
## Adjusting the Data
combatDiffExp <- ebayes(lmFit(combatData, design))
combatLimmaPVals <- combatDiffExp$p.value[,2]
qplot(-log(1:numGenes/numGenes),-log(sort(combatLimmaPVals))) + geom_abline(intercept=0,slope=1) +
ggtitle("QQ plot for corrected correlation-batched expression data") + xlab("-log(Expected)") + ylab("-log(Observed)")
Again, for differential expression, we see that conventional batch correction methods do a satisfactory job controlling the false positives.
Now we look at how “2nd order batch effects” affect network inference.
datawgcna <- mat2adj(t(data),method="WGCNA")
allR <- datawgcna[row(datawgcna)>col(datawgcna)]
tStats <- sort((allR^(1/6))*sqrt(ncol(data)-2))
qplot(sort(abs(qt(1:length(tStats)/length(tStats),df=98))), tStats) + geom_abline(intercept=0,slope=1) +
ggtitle("QQ plot for corrected correlation-batched data WGCNA") + xlab("-log(Expected)") + ylab("-log(Observed)")
GTEx uses WGCNA to find common modules across tissues.
GTEx Consortium attempted to adjust for batch in the following ways
Study design:
To the extent possible, based on sample availability, batches for library construction were designed to include a range of samples from different tissues and to span multiple donors, so as to minimise donor and tissue batch effects.
Expression correction:
the effect of top 3 PEER factors, gender, and 3 genotype PCs were removed.
Neither of these address the “second order” batch issue.
WGCNA co-expression networks and module annotation
Module Bipartite (Blood vs Lung)
Center A
Center B
Center C
Conventional batch correction model: \[Y_g=\alpha_g+\beta_gX+\gamma_igZ+\delta_ig\epsilon_ig\]
where X is the exposure (e.g. treatment/control) and Z is the batch (or other covariates). In the context of network inference, we often want to find \(cor(Y_{g1},Y_{g2})\), independent of \(Z\).
So, in order to model 2nd order batch, what we really want to do is allow for the parameter of interest, \(\beta_g\) to vary by batch.
So, now we set
\[\beta_g^*=\beta_g+\beta_BgZ\]
Where \(\beta_B\) is a new parameter that we need to estimate for each of the \({{p}\choose{2}}\) comparisons.
We can write out a full model for any two genes. Note that \(Y_{g2}\) is another gene in this model.
\[Y_{g2}=\alpha_g+\beta_g^*X+\gamma_igZ+\delta_ig\epsilon_ig\]
or
\[Y_{g2}=\alpha_g+(\beta_g+\beta_BgZ)X+\gamma_igZ+\delta_ig\epsilon_ig\] \[Y_{g2}=\alpha_g+\beta_gX+\beta_BgZX+\gamma_igZ+\delta_ig\epsilon_ig\]
There are many ways to approach this, but I believe the best way is the following steps:
Apply conventional batch correction. This will effectively eliminate the \(\gamma_igZ\) term and we can proceed with the simpler model - \[Y_{g2}=\alpha_g+\beta_gX+\beta_BZX+\delta_ig\epsilon_ig\] - on the combat-corrected data. Further standardize each gene expression (this will not impact the actual results, but will aid in interpretation and computation time)
Fit the following models Reduced: \[Y_{g2}=\alpha_g+\beta_gX+\delta_ig\epsilon_ig\] Full: \[Y_{g2}=\alpha_g+\beta_gX+\beta_BZX+\delta_ig\epsilon_ig\]
Place estimated coefficients into two separate matrices (\(S_\beta,S_B\)). We have tons of options for computing these coefficients. A LASSO-style L1 regularization would probably make the most sense here, but for the purposes of simplicity we will start with OLS.
So, now we have two separate (equal sized) matrices instead of the usual one. \(S_\beta\) is the estimated similarity matrix and \(S_B\) is the “batch impact”. Intuitively, we can imagine that the expected value of \(S_B\) is a zero matrix in the absence of 2nd order batch effects. This lends itself easily for 2nd order batch effect testing - for example, we can compare the two models via likelihood ratio test (LRT). This is nice, but we’re much more interested in 2nd order batch effect correction.
This yields a similarity matrix that is batch-independent. In other words, we can now compare networks computed with different proportions of batch membership. We can think of the adjusted similarity matrix as being the estimated similarity matrix given a standardized representation of batches. This standardization allows us to compare networks which have been inferred with differing batch composition.
Obviously, the usual caveats apply - this correction is most useful when the batches in each exposure are (a) unequal, (b) not too unequal. Small numbers of samples for batches will result in wild fluctations in terms of estimating batch effect.
Gene 1 (BLUE=naive, BLACK=Batch_baseline, RED=Batch_difference)
Given the difficulties of validating networks, validation that batch has been properly controlled will be a challenge. Rather than compare our results to an unreliable, expensive, or non-existent gold-standard benchmark we can instead focus on the network “stability”. In theory, our approach should control for batch and thus be independent of the batch distribution. We can examine this by sampling differing proportions from batches and observing the tendency to vary as a function of the batch proportion.
This approach uses the following procedure for a set of \(n\) total samples from two distinct batches:
The above procedure is performed for both standard uncorrected network inference and also for our corrected version.
2nd order batch effect in GTEx clearly exists (see above), but it can be subtle. For the purposes of a proof on concept, we can generate a much stronger batch effect which clearly demonstrates this approach. We simply select 5000 genes and 5000 separate genes in the other batch, relabelling them Gene1, Gene2,…,Gene5000. Basically, we’re simply mismatching the genes between the batches! This, obviously, causes a completely random rewiring of the network from one batch to the other.
Benchmark taken as the set of edges>.1, which is roughly 38% of all edges
Confounder used - EBV-transformed lymphocytes vs Whole Blood This data is comprised of 245 sample, 54 cell lines and 191 Whole Blood, processed via Joe Paulson. There are 24369 genes, but I subsetted 2000 genes by highest variance (just to reduce computational burden)
Benchmark taken as the set of edges>.1, which is roughly 44% of all edges
This allows us to evaluate the performance of model tweaking alterations. We can impose sparsity constraints, shrinkage or some bayesian prior on the magnitude of the batch effect.
Work in progress -
Used Eclipse data.
Summary: In this analysis, we will perform a very common analysis: Build coexpression networks based on COPD vs Smoker control and identify consensus modules with WGCNA. Gender is treated as a confounder, but is only corrected using standard batch correction methods…
eclipseExp <- read.table("~/gd/Harvard/Research/data/Eclipse/ECLIPSE_Blood_Exp.txt",row.names=1,header=T)
eclipseClinical <- read.table("~/gd/Harvard/Research/data/Eclipse/ECLIPSE_blood.txt",header=T,fill = TRUE, sep="\t",row.names=1)
Data: 226 samples with 148 males and 226 females. 136 COPD and 226 Smoker Controls
table(eclipseClinical$Subject.type,eclipseClinical$Gender)[c(1,3),]
##
## F M
## COPD 45 91
## Smoker Control 29 55
While this data is reasonably balanced, we can use it to demonstrate how sensitive our results are to confounding that was supposedly accounted for already.
cases <- eclipseClinical$Subject.type=="COPD"
controls <- eclipseClinical$Subject.type=="Smoker Control"
males <- eclipseClinical$Gender=="M"
# dataA <- eclipseExp[10101:12100,c(T,F)]
# dataB <- eclipseExp[10101:12100,c(F,T,F,F)]
dataA <- eclipseExp[101:2100, cases&(!males)]
dataB <- eclipseExp[101:2100, controls&(males)]
fixedDataA <- fixDataStructure(t(dataA))
fixedDataB <- fixDataStructure(t(dataB))
fixedDataA_BCM <- blockwiseConsensusModules(fixedDataA, power = 6, minModuleSize = 30, deepSplit = 2, maxBlockSize=30000,
pamRespectsDendro = FALSE,
mergeCutHeight = 0.25, numericLabels = TRUE,
minKMEtoStay = 0,
saveTOMs = TRUE, verbose = 5)
# adjMat <- adjacency(t(dataA), power = 6)
# TOM <- TOMsimilarity(adjMat, TOMDenom = 'min', verbose = 1)
# consensusNetwork <- consensusDissTOMandTree(fixedDataA, 6, TOM = TOM)
fixedDataB_BCM <- blockwiseConsensusModules(fixedDataB, power = 6, minModuleSize = 30, deepSplit = 2, maxBlockSize=30000,
pamRespectsDendro = FALSE,
mergeCutHeight = 0.25, numericLabels = TRUE,
minKMEtoStay = 0,
saveTOMs = TRUE, verbose = 5)
# summary(lm(fixedDataA_BCM$colors==0 ~ as.factor(fixedDataB_BCM$colors)))
# vglmFitMN <- vglm(as.factor(lungBCM$colors) ~ sample(as.factor(bloodBCM$colors)), family=multinomial(refLevel=1))
# pseudo-R^2
pseudoMultiR <- function(groupsA, groupsB, method="nagelkerke"){
require(VGAM)
vglmFitMN <- vglm(as.factor(groupsA) ~ as.factor(groupsB), family=multinomial(refLevel=1))
vglm0 <- vglm(as.factor(groupsA) ~ 1, family=multinomial(refLevel=1))
LLf <- VGAM::logLik(vglmFitMN)
LL0 <- VGAM::logLik(vglm0)
N <- length(groupsA)
if(method=="mcfadden"){
as.vector(1 - (LLf / LL0))
} else if (method=="coxsnell"){
as.vector(1 - exp((2/N) * (LL0 - LLf)))
} else if (method=="nagelkerke"){
as.vector((1 - exp((2/N) * (LL0 - LLf))) / (1 - exp(LL0)^(2/N)))
}
}
Agreement vs Balance
This is great! Very clear dependence of agreement between cases and control on the balance. In other words, with a strong lack of balance, we see weaker agreement, indicating that the results we DO see are a function of the confounder and not the case-control partition.
And now the bad less awesome news…
Corrected Agreement vs Balance
Current correction implementation does not appear to help the cause at the moment.
Likely reason #1:
Approach is too flexible for 2nd order effects.
Solutions:
Likely reason #2: 2nd order effects are not as strong as they are in the demo above (Reminder: we used a completely different network for Batch A compared to Batch B)
Solutions:
library(WGCNA)
dissTOMCases=TOMdist(abs(cor(t(dataset\(exp[c(T,F),casesFilter]),use="p"))^6) dissTOMControls=TOMdist(abs(cor(t(dataset\)exp[c(T,F),controlsFilter]),use=“p”))^6) copdgeneDif <- dissTOMCases-dissTOMControls rownames(copdgeneDif) <- rownames(dataset\(exp)[c(T,F)] colnames(copdgeneDif) <- rownames(dataset\)exp)[c(T,F)]
load(“~/NI_only_0001/ECLIPSE_JASPAR2014_MONSTER.RData”)
dissTOMCases=TOMdist(abs(cor(t(dataset\(exp[c(T,F),casesFilter]),use="p"))^6) dissTOMControls=TOMdist(abs(cor(t(dataset\)exp[c(T,F),controlsFilter]),use=“p”))^6) eclipseDif <- dissTOMCases-dissTOMControls rownames(eclipseDif) <- rownames(dataset\(exp)[c(T,F)] colnames(eclipseDif) <- rownames(dataset\)exp)[c(T,F)]
intersectGenes <- intersect(rownames(eclipseDif),rownames(copdgeneDif)) eclipseDif <- eclipseDif[intersectGenes,intersectGenes] copdgeneDif <- copdgeneDif[intersectGenes,intersectGenes]
library(ggplot2) df <- data.frame(copdgene=c(copdgeneDif[1:100,1:100]),eclipse=c(eclipseDif[1:100,1:100])) ggplot(df,aes(x=copdgene,y=eclipse)) + geom_point(alpha=.1) + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
Model: let \(X\) be our gene expression matrix, containing \(p\) rows and \(n\) columns, corresponding to genes and samples, respectively.
and so let \(X_{g1},X_{g2}\) be two genes within that matrix
and let \(Z\) be a covariate on which we want to apply batch correction.
Combat batch correction and standard normalization methods are first applied. Next each row is standardized, such that \[\bar{X}_{g1} = \bar{X}_{g2}=0\]
We can fit the model:
\[X_{g2}=\beta_0+\beta_1X_{g1}+\beta_2Z + \epsilon_{g2}\]
And then immediately ignore the intercept term (because of the standardization)
So we have:
\[E[X_{g2}]=\beta_{1,g1,g2}X_{g1}+\beta_{2,g1,g2}Z\]
for
\[g1,g2\in {1,2,...p}\]
with constraints
\[\sum_{i=1}^{p}\sum_{j=i+1}^{p}\beta_{2,i,j}^2<\lambda\]
We impose no such restriction on the \(\beta_1\)s
Each one individually can be solved via \[X^*=rbind(cbind(X,Z),matrix(0,\lambda,\lambda,0)\] \[(X_{g1}^{*T}X_{g1}^*)^{-1}X_{g1}^{*T}X_{g2}\]
But this is completely unfeasible at high throughput scale, where we want computing \({n \choose 2}\) models.
I developed a workaround which involves involves computing the hat matrix for each gene separately, storing those \(p\) matrices and computing the associations from there. This requires a lot of memory - since we are storing p*n^2 values, but also allows the algorithm to run approximately asymptotically \(p/2\) times faster.
However, this may be moot, as I have discovered that the sparsity constraint may not be the best idea. Here’s why:
L2 regularization shrinks all estimates toward zero. However, for any constraint, we basically “allow” a certain amount of batch to be estimated. This will result in a similar amound of batch to be controlled for in all scenarios, regardless as to whether batch actually exists in the data (bad!). We will be overfitting too much batch and it will be very hard to identify when we are doing it.
Inverting a \(20,000\times 20,000\) matrix is also not a practical approach (but within reason on cluster). I will table this approach until I cannot come up with another…
Controlling for confounding in network inference. Substantial progress has been made in this project in the following areas:
Initial analyses have proved challenging to demonstrate in real data with real batch,